Hughes, Jacob and Brailsford, David F. and Bagley, Steven R. and Adams, Clive E. (2014) Generating summary documents for a variable-quality PDF document collection. In: ACM Symposium on Document
نویسندگان
چکیده
The Cochrane Schizophrenia Group’s Register of studies details all aspects of the effects of treating people with schizophrenia. It has been gathered over the last 20 years and consists of around 20,000 documents, overwhelmingly in PDF. Document collections of this sort – on a given theme but gathered from a wide range of sources – will generally have huge variability in the quality of the PDF, particularly with respect to the key property of text searchability. Summarising the results from the best of these papers, to allow evidence-based health care decision making, has so far been done by manually creating a summary document, starting from a visual inspection of the relevant PDF file. This labour-intensive process has resulted, to date, in only 4,000 of the papers being summarised – with enormous duplication of effort and with many issues around the validity and reliability of the data extraction. This paper describes a pilot project to provide a computer-assisted framework in which any of the PDF documents could be searched for the occurrence of some 8,000 keywords and key phrases. Once keyword tagging has been completed the framework assists in the generation of a standard summary document, thereby greatly speeding up the production of these summaries. Early examples of the framework are described and its capabilities illustrated.
منابع مشابه
Macdonald, Alexander J. and Brailsford, David F. and Bagley, Steven R. (2005) Encapsulating and Manipulating Component Object Graphics (COGs) using SVG. In: ACM Symposium on Document Engineering
Scalable Vector Graphics (SVG) has an imaging model similar to that of PostScript and PDF but the XML basis of SVG allows it to participate fully, via namespaces, in generalised XML documents. There is increasing interest in using SVG as a Page Description Language and we examine ways in which SVG document components can be encapsulated in contexts where SVG will be used as a rendering technolo...
متن کاملTracking Sub-page Components within Document Workflows
Documents go through numerous transformations and intermediate formats as they are processed from abstract markup into final printable form. This notion of a document workflow is well established but it is common to find that ideas about document components, which might exist in the source code for the document, become completely lost within an amorphous, unstructured, page of PDF prior to bein...
متن کاملKnight, Ian A. and Brailsford, David F. (2016) Enhancing the searchability of page-image PDF documents using an aligned hidden layer from a truth text. In: DocEng '16 Proceedings of the 2016 ACM Symposium on Document
The search accuracy achieved in a PDF image-plus-hiddentext (PDF-IT) document depends upon the accuracy of the optical character recognition (OCR) process that produced the searchable hidden text layer. In many cases recognising words in a blurred area of a PDF page image may exceed the capabilities of an OCR engine. This paper describes a project to replace an inadequate hidden textual layer o...
متن کاملA survey on Automatic Text Summarization
Text summarization endeavors to produce a summary version of a text, while maintaining the original ideas. The textual content on the web, in particular, is growing at an exponential rate. The ability to decipher through such massive amount of data, in order to extract the useful information, is a major undertaking and requires an automatic mechanism to aid with the extant repository of informa...
متن کاملCreating Structured PDF Files
This paper describes a tool for recombining the logical structure from an XML document with the typeset appearance of the corresponding PDF document. The tool uses the XML representation as a template for the insertion of the logical structure into the existing PDF document, thereby creating a Structured/Tagged PDF. The addition of logical structure adds value to the PDF in three ways: the acce...
متن کامل